DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
A snapshot of Wordle guess distributions from David and his Wordle obsessed friends.
| Variables | |
|---|---|
| Count | An integer denoting the frequency of Guesses |
| Initials | A factor denoting whose Wordle guess distribution it is with 5 levels |
| Guesses | A factor denoting how many guesses it took to complete the daily Wordle (as you lose if your 6th guess is incorrect) with 7 levels |
If the population proportion, \(p\), is known—The ground “truth” (parameter) that summarise all possible values we could observe
The sampling distribution of the sample proportion, \(\hat{p}\), is
\[ \hat{p} ~ \text{approx.} ~ \text{Normal} \! \left(\mu_{\hat{p}} = p, \sigma_{\hat{p}} = \sqrt{\frac{p\times(1-p)}{n}} \right) \]
The use of the \(\hat{p}\) subscripts is to make it clear that we are talking about the sampling distribution of \(\hat{p}\) and not the possible values we could observe
More on 2. & 3.
These heuristics are a consequence of relying only on the sampling distribution of \(\hat{p}\)
The standard error of the sample proportion, \(\hat{p}\), is
\[ \text{se}(\hat{p}) = \sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}} \]
where:
Thought Question
If we hold \(n\) constant, what value of \(\hat{p}\) maximises its standard error?
This particular fact, alongside the \(z\)-multiplier (see Slide 12), is often used by polling companies to quantify the uncertainty of their data—and sometimes incorrectly
A survey of 1060 randomly selected US teens ages 13 to 17 found that 605 of them say they have made a new friend online.
It was of interest to infer the population proportion of all US teens who have made a new friend online using this data. Furthermore, use this data to test whether more than 50% of all US teens have made a new friend online.
Made a new friend online?
# Commonly data for proportions are summarised by groups
data <- c(605, 455); groups <- c("Yes", "No")
barchart(data ~ groups, origin = 0,
xlab = "Made a new friend online?", ylab = "Frequency",
main = "Distribution of survey responses")Are all three assumptions met?
(Because we are also conducting a hypothesis test for CS 7.1)
0.0152028, 0.0153574
\[ \hat{p} \pm z^*_{1-\alpha/2} \times \text{se}(\hat{p}) \]
where:
Recall the following assumption for inference on \(p\)
The following heuristic has to be met: \(n \times \hat{p}\) and \(n \times (1 - \hat{p})\) are greater than or equal to 10
The theoretical justification for this arises from the fact that this method of constructing a confidence interval with a \(z\)-multiplier works “most” of the time without specifying a formal statistical model for the data
Recall that 605 out of the 1060 US teens said they made a new friend online. Construct a 95% confidence interval for the population proportion of all US teens who have made a new friend online \[ \hat{p} = \frac{605}{1060}, \quad \text{se}(\hat{p}) = 0.0152 \]
The solution is (0.5409577, 0.6005517)
1-sample proportions test without continuity correction
data: 605 out of 1060, null probability 0.5
X-squared = 21.226, df = 1, p-value = 4.081e-06
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5407550 0.6002435
sample estimates:
p
0.5707547
For CS 7.1, the 95% confidence interval for the population proportion of all US teens who have made a new friend online was (0.5409577, 0.6005517)
The method covered in DATAX121 is known as a z-test for a proportion
The \(\text{se}(\hat{p})\) is defined solely in terms of the observed statistic and the number of observations
Could we use the Normal approximation from Slide 6 directly for the hypothesis test? Yes, we can!
This is because we want to see if \(\hat{p}\) is in an unusual place of a distribution we would expect to see if \(H_0\) is true
Therefore, the test statistic for \(p\) in DATAX121 instead uses the standard error of the hypothesised value of the population (underlying) proportion \(p_0\)
\[ \text{se}(p_0) = \sqrt{\frac{p_0\times(1-p_0)}{n}} \]
\[ z_0 = \frac{\hat{p} - p_0}{\text{se}(p_0)} \]
where:
Let \(Z\) be the Standard Normal distribution1
If it is a two-sided test, e.g. \(H_1 \! : p \neq p_0\)
\(\quad p\text{-value} = 2 \times \mathbb{P}(Z > |z_0|)\)
If it is a one-sided test and \(H_1 \! : p > p_0\)
\(\quad p\text{-value} = \mathbb{P}(Z > z_0)\)
If it is a one-sided test and \(H_1 \! : p < p_0\)
\(\quad p\text{-value} = \mathbb{P}(Z < z_0)\)
Recall that 605 out of the 1060 US teens said they made a new friend online. Use this data to test whether more than 50% of all US teens have made a new friend online. \[ \hat{p} = \frac{605}{1060}, \quad \text{se}(\hat{p}) = 0.0152 \]
Hypothesis statements
\(\quad H_0\!: p = 0.5\)
\(\quad H_1\!: p > 0.5\)
Lastly, we need \(\text{se}(p_0)\)
0.0153574
4.6072134
21.2264151
# The R function to calculate it in one go
prop.test(x = 605, n = 1060, p = 0.5,
alternative = "greater", correct = FALSE)
1-sample proportions test without continuity correction
data: 605 out of 1060, null probability 0.5
X-squared = 21.226, df = 1, p-value = 2.041e-06
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
0.5455993 1.0000000
sample estimates:
p
0.5707547
For CS 7.1, the exact p-value for the appropriate set of hypothesis statements was 2.0405×10-6
Hull, J. D. (1994, December 26). Tale of One Parish. Time, 144(26), 74–76.
Time magazine reported that in a 1994 survey of 507 randomly selected adult American Catholics, 59% answered yes to the question “Do you favour allowing women to be priests?”
Does this data indicate that the majority of all adult American Catholics are in favour?
| Variables | |
|---|---|
| Answer | A factor denoting whether a survey respondent answered either yes or no to the question “Do you favour allowing women to be priests?” |
times.df <- read.csv("datasets/times-poll.csv")
# Tells R that we want to organise the responses when summarising
# as "Yes" then "No"
times.df$Answer <- factor(times.df$Answer, levels = c("Yes", "No"))
# Summarise the data in terms of the sample proportions
xtabs( ~ Answer, data = times.df) |>
proportions()Answer
Yes No
0.5897436 0.4102564
Independence
It has been met as the survey randomly selected adult American Catholics
Heuristic 1
It is quite clear from the bar plot that \(n \times \hat{p} \geq 10\) and \(n \times (1 - \hat{p}) \geq 10\). So we can construct a confidence interval for \(p\)
Heuristic 2
If we test \(p_0 = 0.5\), then \(n \times p_0 \geq 10\) and \(n \times (1 - p_0) \geq 10\) are true statements. So we can conduct a hypothesis test for \(p\) as well
# Review T01: Exploring Data
xtabs(~ Answer, data = times.df) |>
as.data.frame() |>
barchart(Freq ~ Answer, data = _, origin = 0,
main = "Distribution of survey responses",
xlab = "Do you favour allowing women to be priests?",
ylab = "Frequency")# We can make use of R's pipe operator to "forward" the one-way table of counts
# that summarises the data file
xtabs(~ Answer, data = times.df) |>
prop.test(correct = FALSE, p = 0.5)
1-sample proportions test without continuity correction
data: xtabs(~Answer, data = times.df), null probability 0.5
X-squared = 16.333, df = 1, p-value = 5.312e-05
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5464089 0.6317285
sample estimates:
p
0.5897436
95% CI for \(p\)
With 95% confidence, we estimate that the underlying proportion of all American Catholics who were in favour of allowing women to be priests is somewhere between 54.5 and 63.2 percent
Hypothesis Test for \(p = 0.5\)
At the 5% level of significance, we reject the null that the underlying proportion of all American Catholics who were in favour of allowing women to be priests is 50 percent, in favour of the alternative that it is not
(p-value ≈ 0)
1-sample proportions test without continuity correction
data: xtabs(~Answer, data = times.df), null probability 0.5
X-squared = 16.333, df = 1, p-value = 5.312e-05
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5464089 0.6317285
sample estimates:
p
0.5897436